Springboard Apps project - Tier 3 - Complete¶

Welcome to the Apps project! To give you a taste of your future career, we're going to walk through exactly the kind of notebook that you'd write as a data scientist. In the process, we'll be sure to signpost the general framework for our investigation - the Data Science Pipeline - as well as give reasons for why we're doing what we're doing. We're also going to apply some of the skills and knowledge you've built up in the previous unit when reading Professor Spiegelhalter's The Art of Statistics (hereinafter AoS).

So let's get cracking!

Brief

Did Apple Store apps receive better reviews than Google Play apps?

Stages of the project¶

Sourcing and loading
- Load the two datasets
- Pick the columns that we are going to work with
- Subsetting the data on this basis

Cleaning, transforming and visualizing
- Check the data types and fix them
- Add a platform column to both the Apple and the Google dataframes
- Changing the column names to prepare for a join
- Join the two data sets
- Eliminate the NaN values
- Filter only those apps that have been reviewed at least once
- Summarize the data visually and analytically (by the column platform)

Modelling
- Hypothesis formulation
- Getting the distribution of the data
- Permutation test

Evaluating and concluding
- What is our conclusion?
- What is our decision?
- Other models we could have used.

Importing the libraries¶

In this case we are going to import pandas, numpy, scipy, random and matplotlib.pyplot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# scipi is a library for statistical tests and visualizations 
from scipy import stats
# random enables us to generate random numbers
import random

Stage 1 - Sourcing and loading data¶

1a. Source and load the data¶

Let's download the data from Kaggle. Kaggle is a fantastic resource: a kind of social medium for data scientists, it boasts projects, datasets and news on the freshest libraries and technologies all in one place. The data from the Apple Store can be found here and the data from Google Store can be found here. Download the datasets and save them in your working directory.

# Now that the files are saved, we want to load them into Python using read_csv and pandas.

# Create a variable called google, and store in it the path of the csv file that contains your google dataset. 
# If your dataset is in the same folder as this notebook, the path will simply be the name of the file. 
google = 'googleplaystore.csv'

# Read the csv file into a data frame called Google using the read_csv() pandas method.
Google = pd.read_csv(google)

# Using the head() pandas method, observe the first three entries.
Google.head(3)

# Create a variable called apple, and store in it the path of the csv file that contains your apple dataset. 
apple = 'AppleStore.csv'

# Read the csv file into a pandas DataFrame object called Apple.
Apple = pd.read_csv(apple)

# Observe the first three entries like you did with your other data. 
Apple.head(3)

1b. Pick the columns we'll work with¶

From the documentation of these datasets, we can infer that the most appropriate columns to answer the brief are:

Google:
- Category # Do we need this?
- Rating
- Reviews
- Price (maybe)
Apple:
- prime_genre # Do we need this?
- user_rating
- rating_count_tot
- price (maybe)

1c. Subsetting accordingly¶

Let's select only those columns that we want to work with from both datasets. We'll overwrite the subsets in the original variables.

# Subset our DataFrame object Google by selecting just the variables ['Category', 'Rating', 'Reviews', 'Price']
Google = Google[['Category', 'Rating', 'Reviews', 'Price']]

# Check the first three entries
Google.head(3)

# Do the same with our Apple object, selecting just the variables ['prime_genre', 'user_rating', 'rating_count_tot', 'price']
Apple = Apple[['prime_genre', 'user_rating', 'rating_count_tot', 'price']]

# Let's check the first three entries
Apple.head(3)

Stage 2 - Cleaning, transforming and visualizing¶

2a. Check the data types for both Apple and Google, and fix them¶

Types are crucial for data science in Python. Let's determine whether the variables we selected in the previous section belong to the types they should do, or whether there are any errors here.

# Using the dtypes feature of pandas DataFrame objects, check out the data types within our Apple dataframe.
# Are they what you expect?
Apple.dtypes

prime_genre          object
user_rating         float64
rating_count_tot      int64
price               float64
dtype: object

This is looking healthy. But what about our Google data frame?

# Using the same dtypes feature, check out the data types of our Google dataframe. 
Google.dtypes

Category     object
Rating      float64
Reviews      object
Price        object
dtype: object

Weird. The data type for the column 'Price' is 'object', not a numeric data type like a float or an integer. Let's investigate the unique values of this column.

# Use the unique() pandas method on the Price column to check its unique values. 
Google.Price.unique()

array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',
       '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
       '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
       '$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
       '$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
       '$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
       '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
       '$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
       '$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
       '$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
       '$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
       '$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
       '$394.99', '$1.26', 'Everyone', '$1.20', '$1.04'], dtype=object)

Aha! Fascinating. There are actually two issues here.

Firstly, there's a price called Everyone. That is a massive mistake!
Secondly, there are dollar symbols everywhere!

Let's address the first issue first. Let's check the datapoints that have the price value Everyone

# Let's check which data points have the value 'Everyone' for the 'Price' column by subsetting our Google dataframe.

# Subset the Google dataframe on the price column. 
# To be sure: you want to pick out just those rows whose value for the 'Price' column is just 'Everyone'. 
Google[Google.Price == 'Everyone']

Thankfully, it's just one row. We've gotta get rid of it.

# Let's eliminate that row. 

# Subset our Google dataframe to pick out just those rows whose value for the 'Price' column is NOT 'Everyone'. 
# Reassign that subset to the Google variable. 
# You can do this in two lines or one. Your choice! 
Google = Google[Google.Price != 'Everyone']

# Check again the unique values of Google
Google.Price.unique()

array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',
       '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
       '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
       '$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
       '$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
       '$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
       '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
       '$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
       '$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
       '$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
       '$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
       '$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
       '$394.99', '$1.26', '$1.20', '$1.04'], dtype=object)

Our second problem remains: I'm seeing dollar symbols when I close my eyes! (And not in a good way).

This is a problem because Python actually considers these values strings. So we can't do mathematical and statistical operations on them until we've made them into numbers.

# Let's create a variable called nosymb.
# This variable will take the Price column of Google and apply the str.replace() method. 
nosymb = Google.Price.str.replace('$','')

# Now we need to do two things:
# i. Make the values in the nosymb variable numeric using the to_numeric() pandas method.
# ii. Assign this new set of numeric, dollar-sign-less values to Google['Price']. 
# You can do this in one line if you wish.
Google.Price = pd.to_numeric(nosymb)

<ipython-input-11-b82fc5faae68>:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  nosymb = Google.Price.str.replace('$','')

nosymb.unique()

array(['0', '4.99', '3.99', '6.99', '1.49', '2.99', '7.99', '5.99',
       '3.49', '1.99', '9.99', '7.49', '0.99', '9.00', '5.49', '10.00',
       '24.99', '11.99', '79.99', '16.99', '14.99', '1.00', '29.99',
       '12.99', '2.49', '10.99', '1.50', '19.99', '15.99', '33.99',
       '74.99', '39.99', '3.95', '4.49', '1.70', '8.99', '2.00', '3.88',
       '25.99', '399.99', '17.99', '400.00', '3.02', '1.76', '4.84',
       '4.77', '1.61', '2.50', '1.59', '6.49', '1.29', '5.00', '13.99',
       '299.99', '379.99', '37.99', '18.99', '389.99', '19.90', '8.49',
       '1.75', '14.00', '4.85', '46.99', '109.99', '154.99', '3.08',
       '2.59', '4.80', '1.96', '19.40', '3.90', '4.59', '15.46', '3.04',
       '4.29', '2.60', '3.28', '4.60', '28.99', '2.95', '2.90', '1.97',
       '200.00', '89.99', '2.56', '30.99', '3.61', '394.99', '1.26',
       '1.20', '1.04'], dtype=object)

Now let's check the data types for our Google dataframe again, to verify that the 'Price' column really is numeric now.

# Use the function dtypes. 
Google.Price.unique()

array([  0.  ,   4.99,   3.99,   6.99,   1.49,   2.99,   7.99,   5.99,
         3.49,   1.99,   9.99,   7.49,   0.99,   9.  ,   5.49,  10.  ,
        24.99,  11.99,  79.99,  16.99,  14.99,   1.  ,  29.99,  12.99,
         2.49,  10.99,   1.5 ,  19.99,  15.99,  33.99,  74.99,  39.99,
         3.95,   4.49,   1.7 ,   8.99,   2.  ,   3.88,  25.99, 399.99,
        17.99, 400.  ,   3.02,   1.76,   4.84,   4.77,   1.61,   2.5 ,
         1.59,   6.49,   1.29,   5.  ,  13.99, 299.99, 379.99,  37.99,
        18.99, 389.99,  19.9 ,   8.49,   1.75,  14.  ,   4.85,  46.99,
       109.99, 154.99,   3.08,   2.59,   4.8 ,   1.96,  19.4 ,   3.9 ,
         4.59,  15.46,   3.04,   4.29,   2.6 ,   3.28,   4.6 ,  28.99,
         2.95,   2.9 ,   1.97, 200.  ,  89.99,   2.56,  30.99,   3.61,
       394.99,   1.26,   1.2 ,   1.04])

Notice that the column Reviews is still an object column. We actually need this column to be a numeric column, too.

# Convert the 'Reviews' column to a numeric data type. 
Google.loc['Reviews'] = pd.to_numeric(Google.Reviews)

# Let's check the data types of Google again
Google.dtypes

Category     object
Rating      float64
Reviews      object
Price       float64
dtype: object

2b. Add a `platform` column to both the `Apple` and the `Google` dataframes¶

Let's add a new column to both dataframe objects called platform: all of its values in the Google dataframe will be just 'google', and all of its values for the Apple dataframe will be just 'apple'.

The reason we're making this column is so that we can ultimately join our Apple and Google data together, and actually test out some hypotheses to solve the problem in our brief.

# Create a column called 'platform' in both the Apple and Google dataframes. 
# Add the value 'apple' and the value 'google' as appropriate. 
Apple['platform'] = 'Google'
Google['platform'] = 'Apple'

2c. Changing the column names to prepare for our join of the two datasets¶

Since the easiest way to join two datasets is if they have both:

the same number of columns
the same column names we need to rename the columns of Apple so that they're the same as the ones of Google, or vice versa.

In this case, we're going to change the Apple columns names to the names of the Google columns.

This is an important step to unify the two datasets!

# Create a variable called old_names where you'll store the column names of the Apple dataframe. 
# Use the feature .columns.
old_names = Apple.columns

# Create a variable called new_names where you'll store the column names of the Google dataframe. 
new_names = Google.columns

# Use the rename() DataFrame method to change the columns names. 
Apple = Apple.rename(columns = dict(zip(old_names,new_names)))

Apple

2d. Join the two datasets¶

Let's combine the two datasets into a single data frame called df.

# Let's use the append() method to append Apple to Google. 
Google = Google.append(Apple)

# Using the sample() method with the number 12 passed to it, check 12 random points of your dataset.
Google.sample(12)

2e. Eliminate the NaN values¶

As you can see there are some NaN values. We want to eliminate all these NaN values from the table.

# Lets check first the dimesions of df before droping `NaN` values. Use the .shape feature. 
print(Google.shape)

# Use the dropna() method to eliminate all the NaN values, and overwrite the same dataframe with the result. 
Google.dropna(inplace=True)

# Check the new dimesions of our dataframe.
print(Google.shape)

(18038, 5)
(16563, 5)

2f. Filter the data so that we only see whose apps that have been reviewed at least once¶

Apps that haven't been reviewed yet can't help us solve our brief.

So let's check to see if any apps have no reviews at all.

# Subset your df to pick out just those rows whose value for 'Reviews' is equal to 0. 
# Do a count() on the result. 
df = Google[Google.Reviews == 0]
df.count()

Category    929
Rating      929
Reviews     929
Price       929
platform    929
dtype: int64

929 apps do not have reviews, we need to eliminate these points!

# Eliminate the points that have 0 reviews.
Google = Google[Google.Reviews != 0]

2g. Summarize the data visually and analytically (by the column `platform`)¶

What we need to solve our brief is a summary of the Rating column, but separated by the different platforms.

# To summarize analytically, let's use the groupby() method on our df.
Google.groupby('platform').mean()

Interesting! Our means of 4.049697 and 4.191757 don't seem all that different! Perhaps we've solved our brief already: there's no significant difference between Google Play app reviews and Apple Store app reviews. We have an observed difference here: which is simply (4.191757 - 4.049697) = 0.14206. This is just the actual difference that we observed between the mean rating for apps from Google Play, and the mean rating for apps from the Apple Store. Let's look at how we're going to use this observed difference to solve our problem using a statistical test.

Outline of our method:

We'll assume that platform (i.e, whether the app was Google or Apple) really doesn’t impact on ratings.

Given this assumption, we should actually be able to get a difference in mean rating for Apple apps and mean rating for Google apps that's pretty similar to the one we actually got (0.14206) just by: a. shuffling the ratings column, b. keeping the platform column the same, c. calculating the difference between the mean rating for Apple and the mean rating for Google.

We can make the shuffle more useful by doing it many times, each time calculating the mean rating for Apple apps and the mean rating for Google apps, and the difference between these means.

We can then take the mean of all these differences, and this will be called our permutation difference. This permutation difference will be great indicator of what the difference would be if our initial assumption were true and platform really doesn’t impact on ratings.

Now we do a comparison. If the observed difference looks just like the permutation difference, then we stick with the claim that actually, platform doesn’t impact on ratings. If instead, however, the permutation difference differs significantly from the observed difference, we'll conclude: something's going on; the platform does in fact impact on ratings.

As for what the definition of significantly is, we'll get to that. But there’s a brief summary of what we're going to do. Exciting!

If you want to look more deeply at the statistics behind this project, check out this resource.

Let's also get a visual summary of the Rating column, separated by the different platforms.

A good tool to use here is the boxplot!

# Call the boxplot() method on our df.
Google.boxplot(by='platform',column ='Rating')
plt.tight_layout();

Here we see the same information as in the analytical summary, but with a boxplot. Can you see how the boxplot is working here? If you need to revise your boxplots, check out this this link.

Stage 3 - Modelling¶

3a. Hypothesis formulation¶

Our Null hypothesis is just:

H_null: the observed difference in the mean rating of Apple Store and Google Play apps is due to chance (and thus not due to the platform).

The more interesting hypothesis is called the Alternate hypothesis:

H_alternative: the observed difference in the average ratings of apple and google users is not due to chance (and is actually due to platform)

We're also going to pick a significance level of 0.05.

3b. Getting the distribution of the data¶

Now that the hypotheses and significance level are defined, we can select a statistical test to determine which hypothesis to accept.

There are many different statistical tests, all with different assumptions. You'll generate an excellent judgement about when to use which statistical tests over the Data Science Career Track course. But in general, one of the most important things to determine is the distribution of the data.

stats.normaltest()¶

# Create a subset of the column 'Rating' by the different platforms.
# Call the subsets 'apple' and 'google'
apple = Google[Google.platform == 'Apple']['Rating']
google = Google[Google.platform == 'Google']['Rating']

# Using the stats.normaltest() method, get an indication of whether the apple data are normally distributed
# Save the result in a variable called apple_normal, and print it out
apple_normal = stats.normaltest(apple)
apple_normal

NormaltestResult(statistic=3678.6157187516856, pvalue=0.0)

# Do the same with the google data. 
google_normal = stats.normaltest(google)
google_normal

NormaltestResult(statistic=1778.9974234584017, pvalue=0.0)

Since the null hypothesis of the normaltest() is that the data are normally distributed, the lower the p-value in the result of this test, the more likely the data are to be non-normal.

Since the p-values is 0 for both tests, regardless of what we pick for the significance level, our conclusion is that the data are not normally distributed.

We can actually also check out the distribution of the data visually with a histogram. A normal distribution has the following visual characteristics:

- symmetric
- unimodal (one hump)

As well as a roughly identical mean, median and mode.

# Create a histogram of the apple reviews distribution
plt.hist(apple);

# Create a histogram of the google data
plt.hist(google);

3c. Permutation test (non-parametric test )¶

Since the data aren't normally distributed, we're using a non-parametric test here. This is simply a label for statistical tests used when the data aren't normally distributed. These tests are extraordinarily powerful due to how few assumptions we need to make.

Check out more about permutations here.

# Create a column called `Permutation1`, and assign to it the result of permuting (shuffling) the Rating column
# This assignment will use our numpy object's random.permutation() method
Google['Permutation1'] = np.random.permutation(Google.Rating)

# Call the describe() method on our permutation grouped by 'platform'. 
Google.groupby('platform')[['Permutation1']].describe()

# Lets compare with the previous analytical summary:
Google.groupby('platform')[['Rating','Permutation1']].describe().T

Google.groupby('platform')[['Permutation1']].mean()

np.random.seed(42)

# The difference in the means for Permutation1 (0.001103) now looks hugely different to our observed difference of 0.14206. 
# It's sure starting to look like our observed difference is significant, and that the Null is false; platform does impact on ratings
# But to be sure, let's create 10,000 permutations, calculate the mean ratings for Google and Apple apps and the difference between these for each one, and then take the average of all of these differences.
# Let's create a vector with the differences - that will be the distibution of the Null.

# First, make a list called difference.
difference = np.empty(10000)

# Now make a for loop that does the following 10,000 times:
# 1. makes a permutation of the 'Rating' as you did above
# 2. calculates the difference in the mean rating for apple and the mean rating for google. 
for i in range(10000):
    Google['Permutation1'] = np.random.permutation(Google.Rating)
    difference[i] = Google.groupby('platform')[['Permutation1']].mean().loc['Apple']-Google.groupby('platform')[['Permutation1']].mean().loc['Google']

# Make a variable called 'histo', and assign to it the result of plotting a histogram of the difference list. 
histo = plt.hist(difference)

# Now make a variable called obs_difference, and assign it the result of the mean of our 'apple' variable and the mean of our 'google variable'
obs_difference =  np.mean(apple) - np.mean(google)

# Make this difference absolute with the built-in abs() function. 
obs_difference = abs(obs_difference)

# Print out this value; it should be 0.1420605474512291. 
print(obs_difference)

0.1420605474512291

# Another way per the Datacamp in sec.11.3

np.random.seed(42)
def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""

    # Concatenate the data sets: data
    data = np.concatenate((data1, data2))

    # Permute the concatenated array: permuted_data
    permuted_data = np.random.permutation(data)

    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]

    return perm_sample_1, perm_sample_2

def diff_of_means(data_1, data_2):
    """Difference in means of two arrays."""

    # The difference of means of data_1, data_2: diff
    diff = np.mean(data_1)-np.mean(data_2)

    return diff

def draw_perm_reps(data_1, data_2, func, size=1):
    """Generate multiple permutation replicates."""

    # Initialize array of replicates: perm_replicates
    perm_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation sample
        perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)

        # Compute the test statistic
        perm_replicates[i] = func(perm_sample_1, perm_sample_2)

    return perm_replicates

# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(apple,google)

# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(apple,google,diff_of_means, size=10000)

# Compute p-value: p
p = np.sum(perm_replicates >= empirical_diff_means) / len(perm_replicates)

# Print the result
print('p-value =', p)

p-value = 0.0

histo = plt.hist(perm_replicates)

Stage 4 - Evaluating and concluding¶

4a. What is our conclusion?¶

''' What do we know?

Recall: The p-value of our observed data is just the proportion of the data given the null that's at least as extreme as that observed data.

As a result, we're going to count how many of the differences in our difference list are at least as extreme as our observed difference.

If less than or equal to 5% of them are, then we will reject the Null. '''

4b. What is our decision?¶

So actually, zero differences are at least as extreme as our observed difference!

So the p-value of our observed data is 0.

It doesn't matter which significance level we pick; our observed data is statistically significant, and we reject the Null.

We conclude that platform does impact on ratings. Specifically, we should advise our client to integrate only Google Play into their operating system interface.

4c. Other statistical tests, and next steps¶

The test we used here is the Permutation test. This was appropriate because our data were not normally distributed!

As we've seen in Professor Spiegelhalter's book, there are actually many different statistical tests, all with different assumptions. How many of these different statistical tests can you remember? How much do you remember about what the appropriate conditions are under which to use them?

Make a note of your answers to these questions, and discuss them with your mentor at your next call.

	App	Category	Rating	Reviews	Size	Installs	Type	Content Rating	Genres	Last Updated	Current Ver	Android Ver
0	Photo Editor & Candy Camera & Grid & ScrapBook	ART_AND_DESIGN	4.1	159	19M	10,000+	Free	Everyone	Art & Design	January 7, 2018	1.0.0	4.0.3 and up
1	Coloring book moana	ART_AND_DESIGN	3.9	967	14M	500,000+	Free	Everyone	Art & Design;Pretend Play	January 15, 2018	2.0.0	4.0.3 and up
2	U Launcher Lite – FREE Live Cool Themes, Hide ...	ART_AND_DESIGN	4.7	87510	8.7M	5,000,000+	Free	Everyone	Art & Design	August 1, 2018	1.2.4	4.0.3 and up

	Unnamed: 0	id	track_name	size_bytes	currency	price	rating_count_tot	rating_count_ver	user_rating	user_rating_ver	ver	cont_rating	prime_genre	sup_devices.num	ipadSc_urls.num	lang.num	vpp_lic
0	1	281656475	PAC-MAN Premium	100788224	USD	3.99	21292	26	4.0	4.5	6.3.5	4+	Games	38	5	10	1
1	2	281796108	Evernote - stay organized	158578688	USD	0.00	161065	26	4.0	3.5	8.2.2	4+	Productivity	37	5	23	1
2	3	281940292	WeatherBug - Local Weather, Radar, Maps, Alerts	100524032	USD	0.00	188583	2822	3.5	4.5	5.0.0	4+	Weather	37	5	3	1

	platform	Apple	Google
Rating	count	9366.000000	6268.000000
	mean	4.191757	4.049697
	std	0.515219	0.726943
	min	1.000000	1.000000
	25%	4.000000	4.000000
	50%	4.300000	4.500000
	75%	4.500000	4.500000
	max	5.000000	5.000000
Permutation1	count	9366.000000	6268.000000
	mean	4.128902	4.143618
	std	0.617373	0.606230
	min	1.000000	1.000000
	25%	4.000000	4.000000
	50%	4.300000	4.300000
	75%	4.500000	4.500000
	max	5.000000	5.000000

	Category	Rating	Reviews	platform
3665	VIDEO_PLAYERS	4.3	25655305	Apple
2534	Social Networking	4.0	427	Google
1828	GAME	4.5	73539	Apple
5787	FINANCE	4.6	4546	Apple
4364	TOOLS	4.2	38	Apple
6864	BUSINESS	NaN	10	Apple
9844	NEWS_AND_MAGAZINES	3.9	878065	Apple
5472	FAMILY	4.6	87	Apple
1354	Entertainment	2.5	39436	Google
7017	Social Networking	4.0	3060	Google
1708	Education	4.5	18749	Google
6453	FINANCE	4.2	798	Apple